ManyFold: an efficient and flexible library for training and validating protein folding models

Abstract Summary ManyFold is a flexible library for protein structure prediction with deep learning that (i) supports models that use both multiple sequence alignments (MSAs) and protein language model (pLM) embedding as inputs, (ii) allows inference of existing models (AlphaFold and OpenFold), (iii) is fully trainable, allowing for both fine-tuning and the training of new models from scratch and (iv) is written in Jax to support efficient batched operation in distributed settings. A proof-of-concept pLM-based model, pLMFold, is trained from scratch to obtain reasonable results with reduced computational overheads in comparison to AlphaFold. Availability and implementation The source code for ManyFold, the validation dataset and a small sample of training data are available at https://github.com/instadeepai/manyfold. Supplementary information Supplementary data are available at Bioinformatics online.

: Training performance of the pLMFold model. For each training step, the total weighted loss value is shown on the left, while the values of the individual components (structure loss, distogram loss, and pLDDT loss) are shown on the right. Relative training time is also included, noting that convergence was reached in 3.5 days on the v2-128 TPUs. Also note that the total loss is computed as a sum weighted by uncropped sequence length (according to AlphaFold). All curves have been Gaussian-smoothed.

Training set
We use a procedure similar to the one put forth in [4] to collect the training set. Specifically, our training set is comprised of all structures of individual protein chains that can be extracted from entries in the Protein Data Bank (PDB) [2] with: (i) a release date before 2020-05-14 (2018-04-30 for the original Alphafold models [4]), (ii) a resolution less than 9 Å, and (iii) no single amino acid accounting for more than 80% of the sequence of the corresponding chain. This adds up to a total of ∼ 490k structures. The cutoff release date used to build the training set is prior to the CASP14 challenge in order to fairly compare with the performance of open-source models trained on data released before CASP14 took place, such as AlphaFold. Then, we use the same stochastic filters as in [4] to select which chains to train on during each pass over the full training set. Specifically: (i) protein chains are selected with probability 1 512·Csize max(min(N res , 512), 256), where N res is the number of amino acids in the chain and C size is the size of the PDB cluster this protein chain falls into (clusters are derived using MMSeqs2 with a 40% sequence identity cutoff [7]).

Validation sets
CAMEO As our primary validation set, we use the targets released as part of the CAMEO competition between March 2022 and May 2022, which include samples of the three levels of difficulty (easy, medium and hard). Note that the cutoff release date used to build the training set is prior to the release date of these targets.
List of targets in our CAMEO validation set:

CASP14
As an additional validation set, we use the domain-level targets from the Free-Modeling (FM) and Template-Based Modeling hard (TBM-hard) categories of the CASP14 competition, only considering contiguous domains that are part of protein chains that have since been added to PDB. To generate the ground-truth structures for the domains, we align their sequences to the chains found in the corresponding PDB entries and discard domains that can be aligned with either no or more than one chain with 80% sequence identity.

Extended results
We obtained validation results for our pLMFold model, and compared to several AlphaFold models (using either MSAs or single sequences as inputs). Tables S1 and S2 include the results for the CAMEO and CASP14 datasets described above, respectively. The reported metrics are: lDDT (using all atoms or only C α ) [6], TM-score [10], GDT-TS, GDT-HA [9], and the average predicted LDDT (pLDDT) given by the models. For AlphaFold models, 'Full' refers to model_1_ptm and 'No templates' to model_5_ptm. We additionally validated both datasets on the full OpenFold model [1], as well as the first ESMFold [5], OmegaFold [8] and HelixFold-Single [3] models released. As can be seen, the targets in our CASP14 dataset are difficult to predict, resulting in an overall decrease in the performance of all models. This is particularly noticeable for our pLMFold model, but also for OmegaFold and HelixFold-Single. Nevertheless, these pLM-based models still achieve considerably better results than AlphaFold using single sequences as inputs.